Abstract:Given a strategically complex board game, human players can quickly learn to devise strategies after playing a few rounds. Autonomous agents require similar capabilities in realistic interactive environments, yet existing agent benchmarks often fail to fully capture such strategic and evolving decision-making scenarios. We present PTCG-Bench, a benchmark built on the Pok'{e}mon Trading Card Game (PTCG) that evaluates LLM agents at two complementary levels: (1) their decision-making performance within a single complex environment, and (2) their ability to self-evolving through accumulated experience. We further include a modular harness ablation to better interpret agent performance without conflating it with model capability. Our experiments show that, although LLM agents can achieve non-trivial gameplay performance, sustained and stable self-evolution remains challenging, and performance is sensitive to harness design. We hope that PTCG-Bench will facilitate future research on harness-aware and self-evolving agents in realistic interactive environments.
Abstract:Automating repository-level software engineering tasks is a foundational challenge for autonomous code agents, largely due to the difficulty of configuring executable environments. However, manual configuration remains a labor-intensive bottleneck, necessitating a transition toward fully automated environment configuration. Existing approaches often rely on pre-defined artifacts or are restricted to specific programming languages, limiting their applicability to real-world repositories. In this paper, we first propose RAT (RunAnyThing), a language-agnostic framework for automated environment configuration on arbitrary repositories. RAT features a multi-stage pipeline that integrates semantic initialization, a planning mechanism, specialized toolset, and a robust sandbox for configuration. Furthermore, to enable rigorous evaluation, we propose RATBench, a benchmark that reflects the the distribution and heterogeneity of real-world repositories. Extensive experiments demonstrate that RAT achieves state-of-the-art performance, improving the Environment Setup Success Rate (ESSR) by an average of 29.6% over strong baselines.